1. Testing quality of predictions

We now have a function that can predict the price for any living space we want to list as long as we know the number of people it can accommodate. The function we wrote represents a machine learning model, which means that it outputs a prediction based on the input to the model.

A simple way to test the quality of your model is to:

split the dataset into 2 partitions:
- the training set: contains the majority of the rows (75%)
- the test set: contains the remaining minority of the rows (25%)
use the rows in the training set to predict the price value for the rows in the test set
- add new column named predicted_price to the test set
compare the predicted_price values with the actual price values in the test set to see how accurate the predicted values were.

This validation process, where we use the training set to make predictions and the test set to predict values for, is known as train/test validation. Whenever you're performing machine learning, you want to perform validation of some kind to ensure that your machine learning model can make good predictions on new data. While train/test validation isn't perfect, we'll use it to understand the validation process, to select an error metric, and then we'll dive into a more robust validation process later in this course.

Let's modify the predicted_price function to use only the rows in the training set, instead of the full dataset, to find the nearest neighbors, average the price values for those rows, and return the predicted price value. Then, we'll use this function to predict the price for just the rows in the test set. Once we have the predicted price values, we can compare with the true price values and start to understand the model's effectiveness in the next screen.

To start, we've gone ahead and assigned the first 75% of the rows in dc_listings to train_df and the last 25% of the rows to test_df. Here's a diagram explaining the split:

Exercise Start.

Description:

Within the predict_price function, change the Dataframe that temp_df is assigned to. Change it from dc_listings to train_df, so only the training set is used.
Use the Series method apply to pass all of the values in the accommodates column from test_df through the predict_price function.
Assign the resulting Series object to the predict_price column in test_df.



In [3]:

    
import pandas as pd
import numpy as np



In [4]:

    
# loading data
dc_listings = pd.read_csv("dc_airbnb.csv")

# preparing data, stripping commas and  currency symbols
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')

# separte data into train and test (75%/25%)
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = pd.DataFrame(data=train_df)
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)



In [12]:

    
predicted_prices = test_df['accommodates'].apply(predict_price)
test_df['predicted_price'] = predicted_prices









    



/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [13]:

    
test_df









    Out[13]:






  
    
      
      host_response_rate
      host_acceptance_rate
      host_listings_count
      accommodates
      room_type
      bedrooms
      bathrooms
      beds
      price
      cleaning_fee
      security_deposit
      minimum_nights
      maximum_nights
      number_of_reviews
      latitude
      longitude
      city
      zipcode
      state
      predicted_price
    
  
  
    
      2792
      20%
      75%
      1
      2
      Entire home/apt
      0.0
      1.0
      1.0
      120.0
      NaN
      NaN
      1
      1125
      8
      38.922187
      -77.032475
      Washington
      20009
      DC
      104.0
    
    
      2793
      100%
      25%
      2
      3
      Entire home/apt
      2.0
      2.0
      1.0
      140.0
      $75.00
      $150.00
      2
      1125
      7
      38.931681
      -77.044739
      Washington
      20010
      DC
      177.4
    
    
      2794
      NaN
      NaN
      1
      4
      Entire home/apt
      2.0
      1.0
      1.0
      299.0
      NaN
      NaN
      2
      1125
      5
      38.933765
      -77.031488
      Washington
      20010
      DC
      145.8
    
    
      2795
      100%
      100%
      1
      3
      Entire home/apt
      1.0
      1.0
      1.0
      85.0
      $30.00
      $250.00
      1
      92
      2
      38.925692
      -77.032616
      Washington
      20009
      DC
      177.4
    
    
      2796
      100%
      100%
      1
      6
      Entire home/apt
      2.0
      2.0
      3.0
      175.0
      $65.00
      $850.00
      1
      1125
      62
      38.927572
      -77.033604
      Washington
      20009
      DC
      187.2
    
    
      2797
      100%
      100%
      6
      6
      Entire home/apt
      2.0
      1.0
      2.0
      165.0
      $135.00
      $200.00
      3
      30
      0
      38.931383
      -77.030489
      Washington
      20010
      DC
      187.2
    
    
      2798
      94%
      89%
      1
      4
      Private room
      1.0
      1.0
      2.0
      54.0
      $9.00
      NaN
      1
      1125
      53
      38.927983
      -77.030906
      Washington
      20010
      DC
      145.8
    
    
      2799
      100%
      100%
      2
      2
      Private room
      1.0
      1.0
      1.0
      49.0
      $29.00
      NaN
      2
      30
      23
      38.925078
      -77.026436
      Washington
      20001
      DC
      104.0
    
    
      2800
      100%
      100%
      1
      4
      Entire home/apt
      1.0
      1.0
      2.0
      129.0
      $30.00
      NaN
      2
      7
      16
      38.923215
      -77.036174
      Washington
      20009
      DC
      145.8
    
    
      2801
      67%
      0%
      2
      3
      Entire home/apt
      1.0
      1.0
      1.0
      110.0
      $40.00
      $150.00
      3
      365
      34
      38.931496
      -77.040895
      Washington
      20010
      DC
      177.4
    
    
      2802
      100%
      60%
      3
      4
      Entire home/apt
      1.0
      1.0
      2.0
      150.0
      $75.00
      NaN
      4
      1125
      3
      38.929751
      -77.025770
      Washington
      20010
      DC
      145.8
    
    
      2803
      100%
      100%
      2
      6
      Entire home/apt
      0.0
      1.0
      1.0
      119.0
      $35.00
      $200.00
      1
      30
      161
      38.925255
      -77.027583
      Washington
      20009
      DC
      187.2
    
    
      2804
      100%
      100%
      1
      2
      Private room
      1.0
      1.0
      1.0
      55.0
      NaN
      NaN
      1
      1125
      8
      38.930915
      -77.036404
      Washington
      20010
      DC
      104.0
    
    
      2805
      100%
      100%
      1
      4
      Entire home/apt
      1.0
      1.0
      2.0
      168.0
      $25.00
      $200.00
      3
      14
      31
      38.928605
      -77.029234
      Washington
      20010
      DC
      145.8
    
    
      2806
      50%
      67%
      1
      1
      Private room
      1.0
      1.0
      1.0
      55.0
      NaN
      NaN
      5
      1125
      0
      38.931685
      -77.036070
      Washington
      20010
      DC
      89.0
    
    
      2807
      100%
      100%
      2
      2
      Private room
      1.0
      1.0
      1.0
      90.0
      NaN
      NaN
      1
      1125
      6
      38.926477
      -77.029356
      Washington
      20009
      DC
      104.0
    
    
      2808
      86%
      63%
      2
      2
      Entire home/apt
      0.0
      1.0
      1.0
      105.0
      $60.00
      $150.00
      1
      1125
      28
      38.923556
      -77.035924
      Washington
      20009
      DC
      104.0
    
    
      2809
      89%
      33%
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      109.0
      NaN
      NaN
      1
      3
      2
      38.933147
      -77.030822
      Washington
      20010
      DC
      104.0
    
    
      2810
      100%
      83%
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      85.0
      $50.00
      $500.00
      1
      1125
      9
      38.933730
      -77.030417
      Washington
      20010
      DC
      104.0
    
    
      2811
      70%
      80%
      1
      3
      Entire home/apt
      1.0
      1.0
      1.0
      120.0
      NaN
      NaN
      1
      1125
      1
      38.927611
      -77.027248
      Washington
      20009
      DC
      177.4
    
    
      2812
      100%
      100%
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      104.0
      $45.00
      $250.00
      2
      1125
      19
      38.933680
      -77.032162
      Washington
      20010
      DC
      104.0
    
    
      2813
      100%
      100%
      2
      2
      Entire home/apt
      0.0
      1.0
      1.0
      90.0
      $20.00
      NaN
      2
      180
      5
      38.935691
      -77.042352
      Washington
      20010
      DC
      104.0
    
    
      2814
      NaN
      NaN
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      97.0
      $120.00
      NaN
      1
      1125
      0
      38.935399
      -77.024033
      Washington
      20010
      DC
      104.0
    
    
      2815
      NaN
      NaN
      1
      1
      Private room
      1.0
      1.0
      1.0
      130.0
      NaN
      NaN
      1
      1125
      0
      38.931215
      -77.023082
      Washington
      20010
      DC
      89.0
    
    
      2816
      100%
      100%
      2
      2
      Private room
      1.0
      2.5
      1.0
      100.0
      NaN
      NaN
      1
      4
      1
      38.930956
      -77.028276
      Washington
      20010
      DC
      104.0
    
    
      2817
      33%
      67%
      1
      3
      Entire home/apt
      1.0
      1.0
      2.0
      85.0
      $25.00
      NaN
      1
      2
      0
      38.936451
      -77.020377
      Washington
      20010
      DC
      177.4
    
    
      2818
      100%
      100%
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      89.0
      $40.00
      $1,000.00
      4
      1125
      1
      38.927349
      -77.037791
      Washington
      20009
      DC
      104.0
    
    
      2819
      100%
      100%
      1
      4
      Entire home/apt
      2.0
      1.0
      2.0
      129.0
      NaN
      $100.00
      3
      8
      49
      38.927583
      -77.024724
      Washington
      20001
      DC
      145.8
    
    
      2820
      NaN
      NaN
      1
      2
      Private room
      1.0
      1.0
      1.0
      90.0
      $15.00
      NaN
      1
      1125
      0
      38.932100
      -77.025564
      Washington
      20010
      DC
      104.0
    
    
      2821
      100%
      100%
      1
      1
      Private room
      1.0
      1.5
      1.0
      67.0
      $25.00
      NaN
      5
      60
      22
      38.931047
      -77.024741
      Washington
      20010
      DC
      89.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3693
      90%
      0%
      2
      1
      Private room
      1.0
      1.0
      1.0
      100.0
      NaN
      $250.00
      1
      1125
      0
      38.883644
      -76.998215
      Washington
      20003
      DC
      89.0
    
    
      3694
      100%
      67%
      1
      6
      Entire home/apt
      2.0
      2.5
      4.0
      125.0
      NaN
      NaN
      1
      1125
      3
      38.891484
      -76.984784
      Washington
      20002
      DC
      187.2
    
    
      3695
      100%
      NaN
      1
      4
      Entire home/apt
      1.0
      1.0
      2.0
      145.0
      $50.00
      $100.00
      3
      60
      0
      38.880775
      -76.985834
      Washington
      20003
      DC
      145.8
    
    
      3696
      100%
      100%
      2
      2
      Private room
      1.0
      1.0
      1.0
      70.0
      NaN
      NaN
      1
      6
      177
      38.889593
      -76.988789
      Washington
      20003
      DC
      104.0
    
    
      3697
      100%
      100%
      1
      3
      Entire home/apt
      1.0
      1.0
      2.0
      75.0
      $50.00
      $100.00
      2
      1125
      0
      38.886758
      -76.982555
      Washington
      20003
      DC
      177.4
    
    
      3698
      100%
      100%
      1
      2
      Entire home/apt
      1.0
      1.0
      2.0
      180.0
      $35.00
      NaN
      2
      1125
      102
      38.890624
      -76.995733
      Washington
      20003
      DC
      104.0
    
    
      3699
      30%
      100%
      2
      4
      Entire home/apt
      3.0
      1.5
      2.0
      220.0
      $150.00
      $500.00
      3
      1125
      7
      38.885054
      -77.002975
      Washington
      20003
      DC
      145.8
    
    
      3700
      99%
      96%
      32
      4
      Entire home/apt
      2.0
      1.0
      3.0
      189.0
      $75.00
      $200.00
      4
      200
      11
      38.880481
      -76.983665
      Washington
      20003
      DC
      145.8
    
    
      3701
      50%
      NaN
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      90.0
      NaN
      NaN
      3
      1125
      0
      38.890973
      -76.992672
      Washington
      20002
      DC
      104.0
    
    
      3702
      100%
      100%
      1
      4
      Entire home/apt
      2.0
      1.5
      2.0
      250.0
      NaN
      $300.00
      1
      1125
      4
      38.888747
      -76.982742
      Washington
      20003
      DC
      145.8
    
    
      3703
      92%
      94%
      30
      6
      Entire home/apt
      2.0
      2.0
      2.0
      365.0
      $150.00
      NaN
      3
      999
      15
      38.889752
      -77.000151
      Washington
      20003
      DC
      187.2
    
    
      3704
      98%
      52%
      49
      6
      Entire home/apt
      1.0
      2.0
      3.0
      275.0
      NaN
      NaN
      3
      90
      7
      38.888705
      -76.998164
      Washington
      20003
      DC
      187.2
    
    
      3705
      100%
      100%
      2
      5
      Entire home/apt
      1.0
      1.0
      2.0
      97.0
      $55.00
      NaN
      1
      1125
      51
      38.888764
      -76.985002
      Washington
      20003
      DC
      201.4
    
    
      3706
      80%
      100%
      1
      6
      Entire home/apt
      2.0
      2.0
      2.0
      159.0
      $110.00
      NaN
      3
      365
      100
      38.879947
      -76.985445
      Washington
      20003
      DC
      187.2
    
    
      3707
      100%
      100%
      1
      3
      Entire home/apt
      0.0
      1.0
      1.0
      135.0
      $50.00
      $250.00
      2
      1125
      17
      38.887971
      -76.996194
      Washington
      20003
      DC
      177.4
    
    
      3708
      71%
      100%
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      99.0
      $80.00
      $250.00
      10
      700
      14
      38.888414
      -76.997405
      Washington
      20002
      DC
      104.0
    
    
      3709
      75%
      75%
      5
      2
      Private room
      1.0
      1.0
      1.0
      129.0
      $20.00
      $129.00
      1
      120
      5
      38.886721
      -77.002692
      Washington
      20003
      DC
      104.0
    
    
      3710
      96%
      100%
      1
      2
      Entire home/apt
      0.0
      1.0
      1.0
      120.0
      NaN
      NaN
      1
      1125
      199
      38.885432
      -77.006818
      Washington
      20003
      DC
      104.0
    
    
      3711
      100%
      99%
      5
      2
      Private room
      1.0
      1.0
      2.0
      70.0
      $15.00
      NaN
      2
      1125
      115
      38.884153
      -76.989954
      Washington
      20003
      DC
      104.0
    
    
      3712
      NaN
      NaN
      1
      4
      Entire home/apt
      1.0
      1.0
      2.0
      165.0
      $30.00
      NaN
      1
      1125
      0
      38.891573
      -76.991430
      Washington
      20002
      DC
      145.8
    
    
      3713
      100%
      50%
      1
      5
      Entire home/apt
      3.0
      1.0
      3.0
      125.0
      $75.00
      $100.00
      3
      3
      8
      38.882534
      -76.984721
      Washington
      20003
      DC
      201.4
    
    
      3714
      90%
      100%
      1
      3
      Entire home/apt
      1.0
      1.0
      1.0
      109.0
      NaN
      NaN
      1
      1125
      8
      38.885251
      -76.983890
      Washington
      20003
      DC
      177.4
    
    
      3715
      100%
      100%
      1
      1
      Private room
      1.0
      1.0
      1.0
      139.0
      $75.00
      $500.00
      1
      1125
      3
      38.884140
      -76.992884
      Washington
      20003
      DC
      89.0
    
    
      3716
      67%
      67%
      1
      5
      Entire home/apt
      1.0
      1.0
      1.0
      150.0
      NaN
      NaN
      1
      1125
      1
      38.891167
      -77.001699
      Washington
      20002
      DC
      201.4
    
    
      3717
      100%
      0%
      1
      7
      Entire home/apt
      3.0
      2.0
      3.0
      285.0
      $120.00
      $350.00
      2
      1125
      8
      38.889217
      -76.989095
      Washington
      20003
      DC
      394.8
    
    
      3718
      100%
      60%
      1
      4
      Entire home/apt
      1.0
      1.0
      2.0
      135.0
      $45.00
      $400.00
      3
      60
      19
      38.885492
      -76.987765
      Washington
      20003
      DC
      145.8
    
    
      3719
      100%
      50%
      1
      2
      Private room
      1.0
      2.0
      1.0
      79.0
      NaN
      NaN
      3
      365
      36
      38.889401
      -76.986646
      Washington
      20003
      DC
      104.0
    
    
      3720
      100%
      100%
      2
      6
      Entire home/apt
      2.0
      1.0
      3.0
      275.0
      $100.00
      $500.00
      2
      2147483647
      12
      38.889533
      -77.001010
      Washington
      20003
      DC
      187.2
    
    
      3721
      88%
      100%
      1
      2
      Entire home/apt
      1.0
      1.0
      1.0
      179.0
      $25.00
      NaN
      2
      21
      48
      38.890815
      -77.002283
      Washington
      20002
      DC
      104.0
    
    
      3722
      70%
      100%
      1
      3
      Entire home/apt
      0.0
      1.0
      1.0
      110.0
      $40.00
      $200.00
      2
      1125
      1
      38.883646
      -76.999810
      Washington
      20003
      DC
      177.4
    
  

931 rows × 20 columns

2. Error Metrics

We now need a metric that quantifies how good the predictions were on the test set. This class of metrics is called an error metric. As the name suggests, an error metric quantifies how inaccurate our predictions were from the actual values. In our case, the error metric tells us how off our predicted price values were from the actual price values for the living spaces in the test dataset.

We could start by calculating the difference between each predicted and actual value and then averaging these differences. This is referred to as mean error but isn't an effective error metric for most cases. Mean error treats a positive difference differently than a negative difference, but we're really interested in how far off the prediction is in either the positive or negative direction. If the true price was 200 dollars and the model predicted 210 or 190 it's off by 10 dollars either way.

We can instead use the mean absolute error, where we compute the absolute value of each error before we average all the errors.

$\displaystyle MAE = \frac{\left | actual_1 - predicted_1 \right | + \left | actual_2 - predicted_2 \right | + \ \ldots + \left | actual_n - predicted_n \right | }{n}$

Exercise Start.

Description:

Use numpy.absolute() to calculate the mean absolute error between predicted_price and price.
Assign the MAE to mae.



In [14]:

    
mae = np.absolute(test_df['predicted_price'] - test_df['price'])
mae









    Out[14]:





2792     16.0
2793     37.4
2794    153.2
2795     92.4
2796     12.2
2797     22.2
2798     91.8
2799     55.0
2800     16.8
2801     67.4
2802      4.2
2803     68.2
2804     49.0
2805     22.2
2806     34.0
2807     14.0
2808      1.0
2809      5.0
2810     19.0
2811     57.4
2812      0.0
2813     14.0
2814      7.0
2815     41.0
2816      4.0
2817     92.4
2818     15.0
2819     16.8
2820     14.0
2821     22.0
        ...  
3693     11.0
3694     62.2
3695      0.8
3696     34.0
3697    102.4
3698     76.0
3699     74.2
3700     43.2
3701     14.0
3702    104.2
3703    177.8
3704     87.8
3705    104.4
3706     28.2
3707     42.4
3708      5.0
3709     25.0
3710     16.0
3711     34.0
3712     19.2
3713     76.4
3714     68.4
3715     50.0
3716     51.4
3717    109.8
3718     10.8
3719     25.0
3720     87.8
3721     75.0
3722     67.4
dtype: float64

3. Mean Squared Error

For many prediction tasks, we want to penalize predicted values that are further away from the actual value much more than those that are closer to the actual value.

We can instead take the mean of the squared error values, which is called the mean squared error or MSE for short. The MSE makes the gap between the predicted and actual values more clear. A prediction that's off by 100 dollars will have an error (of 10,000) that's 100 times more than a prediction that's off by only 10 dollars (which will have an error of 100).

Here's the formula for MSE:

$\displaystyle MSE = \frac{(actual_1 - predicted_1)^2 + (actual_2 - predicted_2)^2 + \ \ldots + (actual_n - predicted_n)^2 }{n}$

where n represents the number of rows in the test set. Let's calculate the MSE value for the predictions we made on the test set.

Exercise Start.

Description:

Calculate the MSE value between the predicted_price and price columns and assign to mse.



In [15]:

    
mse = (test_df['predicted_price'] - test_df['price'])**2
mse = mse.sum() / len(test_df)
mse









    Out[15]:





18646.525370569325

4. Training another model

The model we trained achieved a mean squared error of around 18646.5. Is this a high or a low mean squared error value? What does this tell us about the quality of the predictions and the model? By itself, the mean squared error value for a single model isn't all that useful.

The units of mean squared error in our case is dollars squared (not dollars), which makes it hard to reason about intuitively as well. We can, however, train another model and then compare the mean squared error values to see which model performs better on a relative basis. Recall that a low error metric means that the gap between the predicted list price and actual list price values is low while a high error metric means the gap is high.

Let's train another model, this time using the bathrooms column, and compare MSE values.

Exercise Start.

Description:

Modify the predict_price function below to use the bathrooms column instead of the accommodates column to make predictions.
Apply the function to test_df and assign the resulting Series object containing the predicted price values to the predicted_price column in test_df.
Calculate the squared error between the price and predicted_price columns in test_df and assign the resulting Series object to the squared_error column in test_df.
Calculate the mean of the squared_error column in test_df and assign to mse.
Use the print function or the variables inspector to display the MSE value.



In [16]:

    
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = pd.DataFrame(data=train_df)
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)



In [18]:

    
predicted_price = test_df['bathrooms'].apply(predict_price)
test_df['predicted_price'] = predicted_price









    



/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [19]:

    
squared_error = (test_df['predicted_price'] - test_df['price'])**2
test_df['squared_error'] = squared_error









    



/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [21]:

    
test_df.head(1)









    Out[21]:






  
    
      
      host_response_rate
      host_acceptance_rate
      host_listings_count
      accommodates
      room_type
      bedrooms
      bathrooms
      beds
      price
      cleaning_fee
      ...
      minimum_nights
      maximum_nights
      number_of_reviews
      latitude
      longitude
      city
      zipcode
      state
      predicted_price
      squared_error
    
  
  
    
      2792
      20%
      75%
      1
      2
      Entire home/apt
      0.0
      1.0
      1.0
      120.0
      NaN
      ...
      1
      1125
      8
      38.922187
      -77.032475
      Washington
      20009
      DC
      138.8
      353.44
    
  

1 rows × 21 columns



In [20]:

    
mse = test_df['squared_error'].mean()
print(mse)









    



18405.444081632548

5. Root Mean Squared Error

While comparing MSE values helps us identify which model performs better on a relative basis, it doesn't help us understand if the performance is good enough in general. This is because the units of the MSE metric are squared (in this case, dollars squared). An MSE value of 16377.5 dollars squared doesn't give us an intuitive sense of how far off the model's predictions are systematically off from the true price value in dollars.

Root mean squared error is an error metric whose units are the base unit (in our case, dollars). RMSE for short, this error metric is calculated by taking the square root of the MSE value:

$\displaystyle RMSE=\sqrt{MSE}$

Since the RMSE value uses the same units as the target column, we can understand how far off in real dollars we can expect the model to perform. For example, if a model achieves an RMSE value of greater than 100, we can expect the predicted price value to be off by 100 dollars on average.

Let's calculate the RMSE value of the model we trained using the bathrooms column.

Exercise Start.

Description:

Calculate the RMSE value of the model we trained using the bathrooms column and assign it to rmse.



In [22]:

    
rmse = np.sqrt(mse)
rmse









    Out[22]:





135.66666532952209

6. Comparing MAE and RMSE

The model achieved an RMSE value of approximately 135.6, which implies that we should expect for the model to be off by 135.6 dollars on average for the predicted price values. Given that most of the living spaces are listed at just a few hundred dollars, we need to reduce this error as much as possible to improve the model's usefulness.

We discussed a few different error metrics we can use to understand a model's performance. As we mentioned earlier, these individual error metrics are helpeful for comparing models. To better understand a specific model, we can compare multiple error metrics for the same model. This requires a better understanding of the mathematical properties of the error metrics.

If you look at the equation for MAE:

$\displaystyle MAE = \frac{\left | actual_1 - predicted_1 \right | + \left | actual_2 - predicted_2 \right | + \ \ldots + \left | actual_n - predicted_n \right | }{n}$

you'll notice that a prediction that the individual errors (or differences between predicted and actual values) grow linearly. A prediction that's off by 10 dollars has a 10 times higher error than a prediction that's off by 1 dollar. If you look at the equation for RMSE, however:

$\displaystyle RMSE = \sqrt{\frac{(actual_1 - predicted_1)^2 + (actual_2 - predicted_2)^2 + \ \ldots + (actual_n - predicted_n)^2 }{n}}$

you'll notice that each error is squared before the square root of the sum of all the errors is taken. This means that the individual errors grows quadratically and has a different effect on the final RMSE value.

Let's look at an example using different data entirely. We've created 2 Series objects containing 2 sets of errors and assigned to errors_one and errors_two.

errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

Exercise Start.

Description:

Calculate the MAE for errors_one and assign to mae_one.
Calculate the RMSE for errors_one and assign to rmse_one.
Calculate the MAE for errors_two and assign to mae_two.
Calculate the RMSE for errors_two and assign to rmse_two.



In [26]:

    
errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

mae_one = sum(errors_one) / len(errors_one)
rmse_one = np.sqrt(sum(errors_one**2) / len(errors_one))
print('ERRORS_ONE:  ', end='')
print('MAE: %.2f ; RMSE: %.2f ; Ratio: %.2f' % (mae_one, rmse_one, mae_one / rmse_one))

mae_two = sum(errors_two) / len(errors_two)
rmse_two = np.sqrt(sum(errors_two**2) / len(errors_two))
print('ERRORS_TWO:  ',end='')
print('MAE: %.2f ; RMSE: %.2f ; Ratio: %.2f' % (mae_two, rmse_two, mae_two / rmse_two))









    



ERRORS_ONE:  MAE: 7.50 ; RMSE: 7.91 ; Ratio: 0.95
ERRORS_TWO:  MAE: 62.50 ; RMSE: 235.82 ; Ratio: 0.27

7. Next steps

While the MAE (7.5) to RMSE (7.9056941504209481) ratio was about 1:1 for the first list of errors, the MAE (62.5) to RMSE (235.82302686548658) ratio was closer to 1:4 for the second list of errors. The only difference between the 2 sets of errors is the extreme 1000 value in errors_two instead of 10. When we're working with larger data sets, we can't inspect each value to understand if there's one or some outliers or if all of the errors are systematically higher. Looking at the ratio of MAE to RMSE can help us understand if there are large but infrequent errors. You can read more about comparing MAE and RMSE in this wonderful post.

In this mission, we learned how to test our machine learning models using basic cross validation and different metrics. In the next 2 missions, we'll explore how adding more features to the machine learning model and selecting a more optimal k value can help improve the model's performance.

	host_response_rate	host_acceptance_rate	host_listings_count	accommodates	room_type	bedrooms	bathrooms	beds	price	cleaning_fee	security_deposit	minimum_nights	maximum_nights	number_of_reviews	latitude	longitude	city	zipcode	state	predicted_price
2792	20%	75%	1	2	Entire home/apt	0.0	1.0	1.0	120.0	NaN	NaN	1	1125	8	38.922187	-77.032475	Washington	20009	DC	104.0
2793	100%	25%	2	3	Entire home/apt	2.0	2.0	1.0	140.0	$75.00	$150.00	2	1125	7	38.931681	-77.044739	Washington	20010	DC	177.4
2794	NaN	NaN	1	4	Entire home/apt	2.0	1.0	1.0	299.0	NaN	NaN	2	1125	5	38.933765	-77.031488	Washington	20010	DC	145.8
2795	100%	100%	1	3	Entire home/apt	1.0	1.0	1.0	85.0	$30.00	$250.00	1	92	2	38.925692	-77.032616	Washington	20009	DC	177.4
2796	100%	100%	1	6	Entire home/apt	2.0	2.0	3.0	175.0	$65.00	$850.00	1	1125	62	38.927572	-77.033604	Washington	20009	DC	187.2
2797	100%	100%	6	6	Entire home/apt	2.0	1.0	2.0	165.0	$135.00	$200.00	3	30	0	38.931383	-77.030489	Washington	20010	DC	187.2
2798	94%	89%	1	4	Private room	1.0	1.0	2.0	54.0	$9.00	NaN	1	1125	53	38.927983	-77.030906	Washington	20010	DC	145.8
2799	100%	100%	2	2	Private room	1.0	1.0	1.0	49.0	$29.00	NaN	2	30	23	38.925078	-77.026436	Washington	20001	DC	104.0
2800	100%	100%	1	4	Entire home/apt	1.0	1.0	2.0	129.0	$30.00	NaN	2	7	16	38.923215	-77.036174	Washington	20009	DC	145.8
2801	67%	0%	2	3	Entire home/apt	1.0	1.0	1.0	110.0	$40.00	$150.00	3	365	34	38.931496	-77.040895	Washington	20010	DC	177.4
2802	100%	60%	3	4	Entire home/apt	1.0	1.0	2.0	150.0	$75.00	NaN	4	1125	3	38.929751	-77.025770	Washington	20010	DC	145.8
2803	100%	100%	2	6	Entire home/apt	0.0	1.0	1.0	119.0	$35.00	$200.00	1	30	161	38.925255	-77.027583	Washington	20009	DC	187.2
2804	100%	100%	1	2	Private room	1.0	1.0	1.0	55.0	NaN	NaN	1	1125	8	38.930915	-77.036404	Washington	20010	DC	104.0
2805	100%	100%	1	4	Entire home/apt	1.0	1.0	2.0	168.0	$25.00	$200.00	3	14	31	38.928605	-77.029234	Washington	20010	DC	145.8
2806	50%	67%	1	1	Private room	1.0	1.0	1.0	55.0	NaN	NaN	5	1125	0	38.931685	-77.036070	Washington	20010	DC	89.0
2807	100%	100%	2	2	Private room	1.0	1.0	1.0	90.0	NaN	NaN	1	1125	6	38.926477	-77.029356	Washington	20009	DC	104.0
2808	86%	63%	2	2	Entire home/apt	0.0	1.0	1.0	105.0	$60.00	$150.00	1	1125	28	38.923556	-77.035924	Washington	20009	DC	104.0
2809	89%	33%	1	2	Entire home/apt	1.0	1.0	1.0	109.0	NaN	NaN	1	3	2	38.933147	-77.030822	Washington	20010	DC	104.0
2810	100%	83%	1	2	Entire home/apt	1.0	1.0	1.0	85.0	$50.00	$500.00	1	1125	9	38.933730	-77.030417	Washington	20010	DC	104.0
2811	70%	80%	1	3	Entire home/apt	1.0	1.0	1.0	120.0	NaN	NaN	1	1125	1	38.927611	-77.027248	Washington	20009	DC	177.4
2812	100%	100%	1	2	Entire home/apt	1.0	1.0	1.0	104.0	$45.00	$250.00	2	1125	19	38.933680	-77.032162	Washington	20010	DC	104.0
2813	100%	100%	2	2	Entire home/apt	0.0	1.0	1.0	90.0	$20.00	NaN	2	180	5	38.935691	-77.042352	Washington	20010	DC	104.0
2814	NaN	NaN	1	2	Entire home/apt	1.0	1.0	1.0	97.0	$120.00	NaN	1	1125	0	38.935399	-77.024033	Washington	20010	DC	104.0
2815	NaN	NaN	1	1	Private room	1.0	1.0	1.0	130.0	NaN	NaN	1	1125	0	38.931215	-77.023082	Washington	20010	DC	89.0
2816	100%	100%	2	2	Private room	1.0	2.5	1.0	100.0	NaN	NaN	1	4	1	38.930956	-77.028276	Washington	20010	DC	104.0
2817	33%	67%	1	3	Entire home/apt	1.0	1.0	2.0	85.0	$25.00	NaN	1	2	0	38.936451	-77.020377	Washington	20010	DC	177.4
2818	100%	100%	1	2	Entire home/apt	1.0	1.0	1.0	89.0	$40.00	$1,000.00	4	1125	1	38.927349	-77.037791	Washington	20009	DC	104.0
2819	100%	100%	1	4	Entire home/apt	2.0	1.0	2.0	129.0	NaN	$100.00	3	8	49	38.927583	-77.024724	Washington	20001	DC	145.8
2820	NaN	NaN	1	2	Private room	1.0	1.0	1.0	90.0	$15.00	NaN	1	1125	0	38.932100	-77.025564	Washington	20010	DC	104.0
2821	100%	100%	1	1	Private room	1.0	1.5	1.0	67.0	$25.00	NaN	5	60	22	38.931047	-77.024741	Washington	20010	DC	89.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3693	90%	0%	2	1	Private room	1.0	1.0	1.0	100.0	NaN	$250.00	1	1125	0	38.883644	-76.998215	Washington	20003	DC	89.0
3694	100%	67%	1	6	Entire home/apt	2.0	2.5	4.0	125.0	NaN	NaN	1	1125	3	38.891484	-76.984784	Washington	20002	DC	187.2
3695	100%	NaN	1	4	Entire home/apt	1.0	1.0	2.0	145.0	$50.00	$100.00	3	60	0	38.880775	-76.985834	Washington	20003	DC	145.8
3696	100%	100%	2	2	Private room	1.0	1.0	1.0	70.0	NaN	NaN	1	6	177	38.889593	-76.988789	Washington	20003	DC	104.0
3697	100%	100%	1	3	Entire home/apt	1.0	1.0	2.0	75.0	$50.00	$100.00	2	1125	0	38.886758	-76.982555	Washington	20003	DC	177.4
3698	100%	100%	1	2	Entire home/apt	1.0	1.0	2.0	180.0	$35.00	NaN	2	1125	102	38.890624	-76.995733	Washington	20003	DC	104.0
3699	30%	100%	2	4	Entire home/apt	3.0	1.5	2.0	220.0	$150.00	$500.00	3	1125	7	38.885054	-77.002975	Washington	20003	DC	145.8
3700	99%	96%	32	4	Entire home/apt	2.0	1.0	3.0	189.0	$75.00	$200.00	4	200	11	38.880481	-76.983665	Washington	20003	DC	145.8
3701	50%	NaN	1	2	Entire home/apt	1.0	1.0	1.0	90.0	NaN	NaN	3	1125	0	38.890973	-76.992672	Washington	20002	DC	104.0
3702	100%	100%	1	4	Entire home/apt	2.0	1.5	2.0	250.0	NaN	$300.00	1	1125	4	38.888747	-76.982742	Washington	20003	DC	145.8
3703	92%	94%	30	6	Entire home/apt	2.0	2.0	2.0	365.0	$150.00	NaN	3	999	15	38.889752	-77.000151	Washington	20003	DC	187.2
3704	98%	52%	49	6	Entire home/apt	1.0	2.0	3.0	275.0	NaN	NaN	3	90	7	38.888705	-76.998164	Washington	20003	DC	187.2
3705	100%	100%	2	5	Entire home/apt	1.0	1.0	2.0	97.0	$55.00	NaN	1	1125	51	38.888764	-76.985002	Washington	20003	DC	201.4
3706	80%	100%	1	6	Entire home/apt	2.0	2.0	2.0	159.0	$110.00	NaN	3	365	100	38.879947	-76.985445	Washington	20003	DC	187.2
3707	100%	100%	1	3	Entire home/apt	0.0	1.0	1.0	135.0	$50.00	$250.00	2	1125	17	38.887971	-76.996194	Washington	20003	DC	177.4
3708	71%	100%	1	2	Entire home/apt	1.0	1.0	1.0	99.0	$80.00	$250.00	10	700	14	38.888414	-76.997405	Washington	20002	DC	104.0
3709	75%	75%	5	2	Private room	1.0	1.0	1.0	129.0	$20.00	$129.00	1	120	5	38.886721	-77.002692	Washington	20003	DC	104.0
3710	96%	100%	1	2	Entire home/apt	0.0	1.0	1.0	120.0	NaN	NaN	1	1125	199	38.885432	-77.006818	Washington	20003	DC	104.0
3711	100%	99%	5	2	Private room	1.0	1.0	2.0	70.0	$15.00	NaN	2	1125	115	38.884153	-76.989954	Washington	20003	DC	104.0
3712	NaN	NaN	1	4	Entire home/apt	1.0	1.0	2.0	165.0	$30.00	NaN	1	1125	0	38.891573	-76.991430	Washington	20002	DC	145.8
3713	100%	50%	1	5	Entire home/apt	3.0	1.0	3.0	125.0	$75.00	$100.00	3	3	8	38.882534	-76.984721	Washington	20003	DC	201.4
3714	90%	100%	1	3	Entire home/apt	1.0	1.0	1.0	109.0	NaN	NaN	1	1125	8	38.885251	-76.983890	Washington	20003	DC	177.4
3715	100%	100%	1	1	Private room	1.0	1.0	1.0	139.0	$75.00	$500.00	1	1125	3	38.884140	-76.992884	Washington	20003	DC	89.0
3716	67%	67%	1	5	Entire home/apt	1.0	1.0	1.0	150.0	NaN	NaN	1	1125	1	38.891167	-77.001699	Washington	20002	DC	201.4
3717	100%	0%	1	7	Entire home/apt	3.0	2.0	3.0	285.0	$120.00	$350.00	2	1125	8	38.889217	-76.989095	Washington	20003	DC	394.8
3718	100%	60%	1	4	Entire home/apt	1.0	1.0	2.0	135.0	$45.00	$400.00	3	60	19	38.885492	-76.987765	Washington	20003	DC	145.8
3719	100%	50%	1	2	Private room	1.0	2.0	1.0	79.0	NaN	NaN	3	365	36	38.889401	-76.986646	Washington	20003	DC	104.0
3720	100%	100%	2	6	Entire home/apt	2.0	1.0	3.0	275.0	$100.00	$500.00	2	2147483647	12	38.889533	-77.001010	Washington	20003	DC	187.2
3721	88%	100%	1	2	Entire home/apt	1.0	1.0	1.0	179.0	$25.00	NaN	2	21	48	38.890815	-77.002283	Washington	20002	DC	104.0
3722	70%	100%	1	3	Entire home/apt	0.0	1.0	1.0	110.0	$40.00	$200.00	2	1125	1	38.883646	-76.999810	Washington	20003	DC	177.4